Midterm Project: Effects of Air Pollution on Countries
1. Introduction
Motivation
Air pollution affects blah blah blah… and with the increased worsening in climate and air quality, blah blah blah…. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?
Set Up
Before we start, we need to ensure that we have all the relevant libraries installed and imported.
Run these in the console to install packages in addition to the ezids package.
install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
2. Data Sources and Data Wrangling
Data Sources
For our analysis, we will be working with 5 main data sources shown in the table below:
| Data | Source | Link |
|---|---|---|
| Deaths Due to Air Pollution of Countries from 1990 - 2017 | Kaggle | Link |
| GDP Annual Growth of Countries from 1960 - 2020 | Kaggle via WorldBank | Link |
| United Nations Population and Region Data | United Nations | Link |
| United Nations ISO-alpha3 code | United Nations | Link |
| spData for Map Geometries | spData for Mapping | Link |
The main variables in our datasets will include:
| Feature | Data Type | Unit of Measure | Notes and Assumptions |
|---|---|---|---|
| GDP (Gross Domestic Product) | Numerical, Continuous | $USD | This is our chosen proxy for measuring a country’s economic status |
| Population Size | Numerical, Continuous | thousands of people | Annual UN estimated |
| Deaths due to Air Pollution | Numerical, Continuous | deaths per million | This is our chosen proxy for measuring the negative affects of air pollution. |
| Country | Qualitative, Categorical | N/A | 231 countries |
| SDG Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Region Classification. |
| Sub Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Sub-Region Classification. |
| ISO-alpha3 Country Code | Qualitative, Categorical | N/A | Standard for identifying countries (text ID). |
| ISO-alpha2 Country Code | Qualitative, Categorical | N/A | Another standard for identifying countries (text ID). |
| M49 Country Code | Numerical, Categorical | N/A | Another standard for identifying countries (numerical ID). |
| Year | Numerical, Categorical | N/A | 1990 to 2017 |
| GDP per Capita | Numerical, Continuous | $USD per person | Normalization of GDP to compare between population sizes (calculated). |
Data Wrangling
While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.
- Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
- Figure 4: Sample screenshot of transformed UN dataset.
3. Load, Clean, and Inspect Data
Load Data
## 'data.frame': 249 obs. of 4 variables:
## $ Country.or.Area: chr "Andorra" "United Arab Emirates (the)" "Afghanistan" "Antigua and Barbuda" ...
## $ ISO.alpha2.code: chr "AD" "AE" "AF" "AG" ...
## $ ISO.alpha3.code: chr "AND" "ARE" "AFG" "ATG" ...
## $ M49.code : int 20 784 4 28 660 8 51 24 10 32 ...
## 'data.frame': 6468 obs. of 7 variables:
## $ Entity : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ Air.pollution..total...deaths.per.100.000. : num 299 291 279 279 287 ...
## $ Indoor.air.pollution..deaths.per.100.000. : num 250 243 232 232 239 ...
## $ Outdoor.particulate.matter..deaths.per.100.000.: num 46.4 46 44.2 44.4 45.6 ...
## $ Outdoor.ozone.pollution..deaths.per.100.000. : num 5.62 5.6 5.61 5.66 5.72 ...
## 'data.frame': 264 obs. of 66 variables:
## $ Country.Name : chr "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Country.Code : chr "ABW" "AFG" "AGO" "ALB" ...
## $ Indicator.Name: chr "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" "GDP (current US$)" ...
## $ Indicator.Code: chr "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" "NY.GDP.MKTP.CD" ...
## $ X1960 : num NA 537777811 NA NA NA ...
## $ X1961 : num NA 548888896 NA NA NA ...
## $ X1962 : num NA 546666678 NA NA NA ...
## $ X1963 : num NA 751111191 NA NA NA ...
## $ X1964 : num NA 800000044 NA NA NA ...
## $ X1965 : num NA 1006666638 NA NA NA ...
## $ X1966 : num NA 1399999967 NA NA NA ...
## $ X1967 : num NA 1673333418 NA NA NA ...
## $ X1968 : num NA 1373333367 NA NA NA ...
## $ X1969 : num NA 1408888922 NA NA NA ...
## $ X1970 : num NA 1748886596 NA NA 78619206 ...
## $ X1971 : num NA 1831108971 NA NA 89409820 ...
## $ X1972 : num NA 1595555476 NA NA 113408232 ...
## $ X1973 : num NA 1733333264 NA NA 150820103 ...
## $ X1974 : num NA 2155555498 NA NA 186558696 ...
## $ X1975 : num NA 2366666616 NA NA 220127246 ...
## $ X1976 : num NA 2555555567 NA NA 227281025 ...
## $ X1977 : num NA 2953333418 NA NA 254020153 ...
## $ X1978 : num NA 3300000109 NA NA 308008898 ...
## $ X1979 : num NA 3697940410 NA NA 411578334 ...
## $ X1980 : num NA 3641723322 5930503401 NA 446416106 ...
## $ X1981 : num NA 3478787909 5550483036 NA 388958731 ...
## $ X1982 : num NA NA 5550483036 NA 375895956 ...
## $ X1983 : num NA NA 5784341596 NA 327861833 ...
## $ X1984 : num NA NA 6131475065 1857338012 330070689 ...
## $ X1985 : num NA NA 7553560459 1897050133 346737965 ...
## $ X1986 : num 405463417 NA 7072063345 2097326250 482000594 ...
## $ X1987 : num 487602458 NA 8083872012 2080796250 611316399 ...
## $ X1988 : num 596423607 NA 8769250550 2051236250 721425939 ...
## $ X1989 : num 695304363 NA 10201099040 2253090000 795449332 ...
## $ X1990 : num 764887117 NA 11228764963 2028553750 1029048482 ...
## $ X1991 : num 872138715 NA 10603784541 1099559028 1106928583 ...
## $ X1992 : num 958463184 NA 8307810974 652174991 1210013652 ...
## $ X1993 : num 1082979721 NA 5768720422 1185315468 1007025755 ...
## $ X1994 : num 1245688268 NA 4438321017 1880951520 1017549124 ...
## $ X1995 : num 1320474860 NA 5538749260 2392764853 1178738991 ...
## $ X1996 : num 1379960894 NA 7526446606 3199642580 1223945357 ...
## $ X1997 : num 1531944134 NA 7648377413 2258515610 1180597273 ...
## $ X1998 : num 1665100559 NA 6506229607 2545967253 1211932398 ...
## $ X1999 : num 1722798883 NA 6152922943 3212119044 1239876305 ...
## $ X2000 : num 1873452514 NA 9129594819 3480355189 1429049198 ...
## $ X2001 : num 1920111732 NA 8936063723 3922099471 1546926174 ...
## $ X2002 : num 1941340782 4055179566 15285594828 4348070165 1755910032 ...
## $ X2003 : num 2021229050 4515558808 17812705294 5611492283 2361726862 ...
## $ X2004 : num 2228491620 5226778809 23552052408 7184681399 2894921778 ...
## $ X2005 : num 2330726257 6209137625 36970918699 8052075642 3159905484 ...
## $ X2006 : num 2424581006 6971285595 52381006892 8896073938 3456442103 ...
## $ X2007 : num 2615083799 9747879532 65266452081 10677321490 3952600602 ...
## $ X2008 : num 2745251397 10109225814 88538611205 12881354104 4085630584 ...
## $ X2009 : num 2498882682 12439087077 70307163678 12044223353 3674409558 ...
## $ X2010 : num 2390502793 15856574731 83799496611 11926928506 3449966857 ...
## $ X2011 : num 2549720670 17804292964 111789686464 12890765324 3629203786 ...
## $ X2012 : num 2534636872 20001598506 128052853643 12319830252 3188808943 ...
## $ X2013 : num 2701675978 20561069558 136709862831 12776217195 3193704343 ...
## $ X2014 : num 2765363128 20484885120 145712200313 13228144008 3271808157 ...
## $ X2015 : num 2919553073 19907111419 116193649124 11386846319 2789870188 ...
## $ X2016 : num 2965921788 18017749074 101123851090 11861200797 2896679212 ...
## $ X2017 : num 3056424581 18869945678 122123822334 13019693451 3000180750 ...
## $ X2018 : num NA 18353881130 101353230785 15147020535 3218316013 ...
## $ X2019 : num NA 19291104008 88815697793 15279183290 3154057987 ...
## $ X2020 : logi NA NA NA NA NA NA ...
## $ X : logi NA NA NA NA NA NA ...
## 'data.frame': 235 obs. of 78 variables:
## $ SDGRegion : chr "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" "SUB-SAHARAN AFRICA" ...
## $ SubRegion : chr "Eastern Africa" "Eastern Africa" "Eastern Africa" "Eastern Africa" ...
## $ Country : chr "Burundi" "Comoros" "Djibouti" "Eritrea" ...
## $ Notes : int NA NA NA NA NA NA NA NA 1 2 ...
## $ Country.code: int 108 174 262 232 231 404 450 454 480 175 ...
## $ Type : chr "Country/Area" "Country/Area" "Country/Area" "Country/Area" ...
## $ Parent.code : int 910 910 910 910 910 910 910 910 910 910 ...
## $ X1950 : chr " 2 309" " 159" " 62" " 822" ...
## $ X1951 : chr " 2 360" " 163" " 63" " 835" ...
## $ X1952 : chr " 2 406" " 167" " 65" " 849" ...
## $ X1953 : chr " 2 449" " 170" " 66" " 865" ...
## $ X1954 : chr " 2 492" " 173" " 68" " 882" ...
## $ X1955 : chr " 2 537" " 176" " 70" " 900" ...
## $ X1956 : chr " 2 585" " 179" " 71" " 919" ...
## $ X1957 : chr " 2 636" " 182" " 74" " 939" ...
## $ X1958 : chr " 2 689" " 185" " 76" " 961" ...
## $ X1959 : chr " 2 743" " 188" " 80" " 983" ...
## $ X1960 : chr " 2 798" " 191" " 84" " 1 008" ...
## $ X1961 : chr " 2 852" " 194" " 89" " 1 033" ...
## $ X1962 : chr " 2 907" " 197" " 94" " 1 060" ...
## $ X1963 : chr " 2 964" " 200" " 101" " 1 089" ...
## $ X1964 : chr " 3 026" " 204" " 108" " 1 118" ...
## $ X1965 : chr " 3 094" " 207" " 115" " 1 148" ...
## $ X1966 : chr " 3 170" " 211" " 123" " 1 179" ...
## $ X1967 : chr " 3 253" " 216" " 131" " 1 210" ...
## $ X1968 : chr " 3 337" " 221" " 140" " 1 243" ...
## $ X1969 : chr " 3 414" " 225" " 150" " 1 276" ...
## $ X1970 : chr " 3 479" " 230" " 160" " 1 311" ...
## $ X1971 : chr " 3 530" " 235" " 169" " 1 347" ...
## $ X1972 : chr " 3 570" " 239" " 179" " 1 385" ...
## $ X1973 : chr " 3 605" " 244" " 191" " 1 424" ...
## $ X1974 : chr " 3 646" " 250" " 205" " 1 464" ...
## $ X1975 : chr " 3 701" " 257" " 224" " 1 505" ...
## $ X1976 : chr " 3 771" " 266" " 249" " 1 548" ...
## $ X1977 : chr " 3 854" " 276" " 277" " 1 592" ...
## $ X1978 : chr " 3 949" " 287" " 308" " 1 637" ...
## $ X1979 : chr " 4 051" " 297" " 336" " 1 684" ...
## $ X1980 : chr " 4 157" " 308" " 359" " 1 733" ...
## $ X1981 : chr " 4 267" " 318" " 375" " 1 785" ...
## $ X1982 : chr " 4 380" " 327" " 385" " 1 837" ...
## $ X1983 : chr " 4 498" " 336" " 394" " 1 891" ...
## $ X1984 : chr " 4 621" " 345" " 406" " 1 946" ...
## $ X1985 : chr " 4 751" " 355" " 426" " 2 004" ...
## $ X1986 : chr " 4 887" " 366" " 454" " 2 065" ...
## $ X1987 : chr " 5 027" " 377" " 490" " 2 127" ...
## $ X1988 : chr " 5 169" " 388" " 529" " 2 186" ...
## $ X1989 : chr " 5 307" " 400" " 564" " 2 231" ...
## $ X1990 : chr " 5 439" " 412" " 590" " 2 259" ...
## $ X1991 : chr " 5 565" " 424" " 607" " 2 266" ...
## $ X1992 : chr " 5 686" " 436" " 615" " 2 258" ...
## $ X1993 : chr " 5 798" " 449" " 619" " 2 239" ...
## $ X1994 : chr " 5 899" " 462" " 622" " 2 218" ...
## $ X1995 : chr " 5 987" " 475" " 630" " 2 204" ...
## $ X1996 : chr " 6 060" " 489" " 644" " 2 196" ...
## $ X1997 : chr " 6 122" " 502" " 661" " 2 195" ...
## $ X1998 : chr " 6 186" " 515" " 680" " 2 206" ...
## $ X1999 : chr " 6 267" " 529" " 700" " 2 237" ...
## $ X2000 : chr " 6 379" " 542" " 718" " 2 292" ...
## $ X2001 : chr " 6 526" " 556" " 733" " 2 375" ...
## $ X2002 : chr " 6 704" " 569" " 747" " 2 481" ...
## $ X2003 : chr " 6 909" " 583" " 760" " 2 601" ...
## $ X2004 : chr " 7 132" " 597" " 772" " 2 720" ...
## $ X2005 : chr " 7 365" " 612" " 783" " 2 827" ...
## $ X2006 : chr " 7 608" " 626" " 795" " 2 918" ...
## $ X2007 : chr " 7 862" " 642" " 805" " 2 997" ...
## $ X2008 : chr " 8 126" " 657" " 816" " 3 063" ...
## $ X2009 : chr " 8 398" " 673" " 828" " 3 120" ...
## $ X2010 : chr " 8 676" " 690" " 840" " 3 170" ...
## $ X2011 : chr " 8 958" " 707" " 854" " 3 214" ...
## $ X2012 : chr " 9 246" " 724" " 868" " 3 250" ...
## $ X2013 : chr " 9 540" " 742" " 883" " 3 281" ...
## $ X2014 : chr " 9 844" " 759" " 899" " 3 311" ...
## $ X2015 : chr " 10 160" " 777" " 914" " 3 343" ...
## $ X2016 : chr " 10 488" " 796" " 929" " 3 377" ...
## $ X2017 : chr " 10 827" " 814" " 944" " 3 413" ...
## $ X2018 : chr " 11 175" " 832" " 959" " 3 453" ...
## $ X2019 : chr " 11 531" " 851" " 974" " 3 497" ...
## $ X2020 : chr " 11 891" " 870" " 988" " 3 546" ...
## tibble [177 × 11] (S3: sf/tbl_df/tbl/data.frame)
## $ iso_a2 : chr [1:177] "FJ" "TZ" "EH" "CA" ...
## $ name_long: chr [1:177] "Fiji" "Tanzania" "Western Sahara" "Canada" ...
## $ continent: chr [1:177] "Oceania" "Africa" "Africa" "North America" ...
## $ region_un: chr [1:177] "Oceania" "Africa" "Africa" "Americas" ...
## $ subregion: chr [1:177] "Melanesia" "Eastern Africa" "Northern Africa" "Northern America" ...
## $ type : chr [1:177] "Sovereign country" "Sovereign country" "Indeterminate" "Sovereign country" ...
## $ area_km2 : num [1:177] 19290 932746 96271 10036043 9510744 ...
## $ pop : num [1:177] 885806 52234869 NA 35535348 318622525 ...
## $ lifeExp : num [1:177] 70 64.2 NA 82 78.8 ...
## $ gdpPercap: num [1:177] 8222 2402 NA 43079 51922 ...
## $ geom :sfc_MULTIPOLYGON of length 177; first list element: List of 3
## ..$ :List of 1
## .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
## ..$ :List of 1
## .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
## ..$ :List of 1
## .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geom"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
## ..- attr(*, "names")= chr [1:10] "iso_a2" "name_long" "continent" "region_un" ...
Clean Data
First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).
Clean air_pollution_df:
## 'data.frame': 6468 obs. of 4 variables:
## $ Country : Factor w/ 231 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ISO.alpha3.code : Factor w/ 197 levels "","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Deaths.Air.Pollution.per.100k: num 299 291 279 279 287 ...
Clean gdp_df:
## tibble [12,401 × 4] (S3: tbl_df/tbl/data.frame)
## $ Country : Factor w/ 259 levels "Afghanistan",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ ISO.alpha3.code: Factor w/ 259 levels "ABW","AFG","AGO",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 60 levels "1960","1961",..: 27 28 29 30 31 32 33 34 35 36 ...
## $ GDP.USD : num [1:12401] 405463417 487602458 596423607 695304363 764887117 ...
Clean population_region_df:
## 'data.frame': 16685 obs. of 6 variables:
## $ SDGRegion : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SubRegion : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Country : Factor w/ 235 levels "Afghanistan",..: 34 34 34 34 34 34 34 34 34 34 ...
## $ M49.code : Factor w/ 235 levels "100","104","108",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Year : Factor w/ 71 levels "1950","1951",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Population.thousands: num 2309 2360 2406 2449 2492 ...
Clean population_region_df:
## 'data.frame': 249 obs. of 4 variables:
## $ Country.or.Area: Factor w/ 249 levels "Afghanistan",..: 6 234 1 10 8 3 12 7 9 11 ...
## $ ISO.alpha2.code: Factor w/ 248 levels "AD","AE","AF",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ ISO.alpha3.code: Factor w/ 249 levels "ABW","AFG","AGO",..: 7 8 2 14 4 6 10 3 12 9 ...
## $ M49.code : Factor w/ 249 levels "4","8","10","12",..: 6 225 1 8 190 2 16 7 3 10 ...
Clean world:
## tibble [177 × 2] (S3: sf/tbl_df/tbl/data.frame)
## $ iso_a2: Factor w/ 175 levels "AE","AF","AL",..: 53 162 48 26 165 89 167 124 72 7 ...
## $ geom :sfc_MULTIPOLYGON of length 177; first list element: List of 3
## ..$ :List of 1
## .. ..$ : num [1:5, 1:2] -180 -180 -180 -180 -180 ...
## ..$ :List of 1
## .. ..$ : num [1:9, 1:2] 178 178 177 177 178 ...
## ..$ :List of 1
## .. ..$ : num [1:8, 1:2] 180 180 179 179 179 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geom"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "iso_a2"
Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.
Final DataFrame Construction
Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).
## 'data.frame': 5197 obs. of 12 variables:
## $ ISO.alpha2.code : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ M49.code : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
## $ ISO.alpha3.code : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Country.x : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Deaths.Air.Pollution.per.100k: num 17.7 17.2 29 28.7 28.5 ...
## $ GDP.USD : num 3188808943 3193704343 1029048482 1106928583 1210013652 ...
## $ SDGRegion : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SubRegion : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ Population.thousands : num 82 81 55 57 59 61 63 64 64 64 ...
## $ geom :sfc_MULTIPOLYGON of length 5197; first list element: list()
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## $ gdp.per.capita : num 38887914 39428449 18709972 19419800 20508706 ...
Our dataset is finally ready to be analyzed.
4. EDA - Exploratory Data Analysis
Quick Plots
Let’s start our EDA process by just looking at some quick plots to look at the distribution of data.
Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita
- Figure 5,6,7: Histogram of Air Pollution Induced Deaths, Population, and GDP per Capita.
Looks like deaths.air.pollution.per.100k, population, and gdp.per.capita are not normal and are all right skewed.
Boxplot of Air Pollution Induced Deaths, Population, and GDP per Capita
Let’s look at a boxplot for the outliers.
- Figure 8: Boxplot of Deaths per 100,000 from Air Pollution vs SDG Region
Interesting to note that Australia/New Zealand, Europe, North America seem to have the lowest deaths per 100k from air pollution and are all fairly compactly packed together (low variance) relative to other regions around the world. Furthermore, these region contain the most advanced countries.
Let’s take another look but at SubRegions.
- Figure 9: Boxplot of Deaths per 100k from Air Pollution vs Sub Region
Separating out into an even granular grouping of regions show some trends where Australia/New Zealand, North America, Northern Europe, and Western Europe all have low deaths per 100k and have low variance. Historically, these regions consist of countries that have been considered ‘First World’ before our first year of analysis of 1990. We will dig into this more later in our SMART questions.
What does the GDP per capita of these regions look like comparatively? Let’s take a look.
- Figure 10: Boxplot of GDP per Capita vs Sub Region
Interesting to observe that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively. We will try to see if we can quantify this relationship later on in our main research analysis.
Map of Countries
Plotting maps and maps with intensities will be useful for us to visualize our data and the results of our analysis.
- Figure 11: Global Map of SDGRegions and SubRegions
- Figure 12: Global Intensity Map of Key Numerical Features, 1990 to 2017
Looks like some inverse correlation between gdp.per.capita and deaths.air.pollution.per.100k.
We can also use ggplot2 to have a bit more control over map plotting.
- Figure 13: Global Intensity Map of Deaths due to Air Pollution per 100k People, 1990 to 2017
- Figure 14: Intensity Map of Deaths due to Air Pollution per 100k People in East and Southeastern Asia, 2017
SMART Questions
1. Is there a relationship between population size and Deaths per 100,000 due to air pollution?
Below, we would like to measure the relationship between Population size (in thousands) and Deaths per 100,000 due to air pollution. Since these variables are numerical, we have to confirm the normal distribution of both variables, and from the results below, we see that there is no correlation between a country’s population size and their deaths due to air pollution. we do observe a negative correlation between Deaths due to air pollution and GDP per Capita.
str(final_df)## 'data.frame': 5197 obs. of 12 variables:
## $ ISO.alpha2.code : Factor w/ 248 levels "AD","AE","AF",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ M49.code : Factor w/ 249 levels "4","8","10","12",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 23 24 1 2 3 4 5 6 7 8 ...
## $ ISO.alpha3.code : Factor w/ 197 levels "","AFG","AGO",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Country.x : Factor w/ 231 levels "Afghanistan",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Deaths.Air.Pollution.per.100k: num 17.7 17.2 29 28.7 28.5 ...
## $ GDP.USD : num 3188808943 3193704343 1029048482 1106928583 1210013652 ...
## $ SDGRegion : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SubRegion : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ Population.thousands : num 82 81 55 57 59 61 63 64 64 64 ...
## $ geom :sfc_MULTIPOLYGON of length 5197; first list element: list()
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## $ gdp.per.capita : num 38887914 39428449 18709972 19419800 20508706 ...
#check normality
qqnorm(final_df$Population.thousands)qqnorm(final_df$Deaths.Air.Pollution.per.100k)cor(final_df$Population.thousands, final_df$Deaths.Air.Pollution.per.100k, method = c("spearman"))## [1] 0.037
#correlation matrix
pop_poll_cor<- cor(select(final_df, Deaths.Air.Pollution.per.100k, Population.thousands, gdp.per.capita,))
pop_poll_cor## Deaths.Air.Pollution.per.100k
## Deaths.Air.Pollution.per.100k 1.000
## Population.thousands 0.069
## gdp.per.capita -0.543
## Population.thousands gdp.per.capita
## Deaths.Air.Pollution.per.100k 0.069 -0.543
## Population.thousands 1.000 -0.040
## gdp.per.capita -0.040 1.000
xkabledply(pop_poll_cor)| Deaths.Air.Pollution.per.100k | Population.thousands | gdp.per.capita | |
|---|---|---|---|
| Deaths.Air.Pollution.per.100k | 1.000 | 0.069 | -0.543 |
| Population.thousands | 0.069 | 1.000 | -0.040 |
| gdp.per.capita | -0.543 | -0.040 | 1.000 |
#plot
loadPkg("corrplot")
corrplot(pop_poll_cor)3. Which regions have the lowest and highest deaths due to air pollution?
library(scales)
#Aggregate data by total deaths by year and region
deathsbyyear_reg <- group_by(.data = final_df,Year, SDGRegion)
totdeath_reg <- summarize(.data = deathsbyyear_reg, total = sum(Deaths.Air.Pollution.per.100k, na.rm = TRUE))
str(totdeath_reg)## tibble [252 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ Year : Factor w/ 28 levels "1990","1991",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ SDGRegion: Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 1 2 3 4 5 6 7 8 9 1 ...
## $ total : num [1:252] 50.5 1759.9 1290.9 1540.6 2272.3 ...
## - attr(*, "groups")= tibble [28 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Year : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
## ..$ .rows: list<int> [1:28]
## .. ..$ : int [1:9] 1 2 3 4 5 6 7 8 9
## .. ..$ : int [1:9] 10 11 12 13 14 15 16 17 18
## .. ..$ : int [1:9] 19 20 21 22 23 24 25 26 27
## .. ..$ : int [1:9] 28 29 30 31 32 33 34 35 36
## .. ..$ : int [1:9] 37 38 39 40 41 42 43 44 45
## .. ..$ : int [1:9] 46 47 48 49 50 51 52 53 54
## .. ..$ : int [1:9] 55 56 57 58 59 60 61 62 63
## .. ..$ : int [1:9] 64 65 66 67 68 69 70 71 72
## .. ..$ : int [1:9] 73 74 75 76 77 78 79 80 81
## .. ..$ : int [1:9] 82 83 84 85 86 87 88 89 90
## .. ..$ : int [1:9] 91 92 93 94 95 96 97 98 99
## .. ..$ : int [1:9] 100 101 102 103 104 105 106 107 108
## .. ..$ : int [1:9] 109 110 111 112 113 114 115 116 117
## .. ..$ : int [1:9] 118 119 120 121 122 123 124 125 126
## .. ..$ : int [1:9] 127 128 129 130 131 132 133 134 135
## .. ..$ : int [1:9] 136 137 138 139 140 141 142 143 144
## .. ..$ : int [1:9] 145 146 147 148 149 150 151 152 153
## .. ..$ : int [1:9] 154 155 156 157 158 159 160 161 162
## .. ..$ : int [1:9] 163 164 165 166 167 168 169 170 171
## .. ..$ : int [1:9] 172 173 174 175 176 177 178 179 180
## .. ..$ : int [1:9] 181 182 183 184 185 186 187 188 189
## .. ..$ : int [1:9] 190 191 192 193 194 195 196 197 198
## .. ..$ : int [1:9] 199 200 201 202 203 204 205 206 207
## .. ..$ : int [1:9] 208 209 210 211 212 213 214 215 216
## .. ..$ : int [1:9] 217 218 219 220 221 222 223 224 225
## .. ..$ : int [1:9] 226 227 228 229 230 231 232 233 234
## .. ..$ : int [1:9] 235 236 237 238 239 240 241 242 243
## .. ..$ : int [1:9] 244 245 246 247 248 249 250 251 252
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
deaths_line <- ggplot() +
geom_line(data = totdeath_reg, mapping = aes(x = Year, y = total, group = SDGRegion, color = SDGRegion), size = 1.2) +
geom_point(data = totdeath_reg, mapping = aes(x = Year, y = total, group = SDGRegion, color = SDGRegion), size = 1.2) +
scale_y_continuous(label = comma, limits = c(0, 15000), breaks = seq(0,15000,1500))
# scale_x_discrete(limits = c(1990, 2017), breaks = seq(1990,2017,2))
deaths_line <- deaths_line + labs(title = "Deaths per 100,000 by Air Pollution, by Region",
subtitle = "1990 - 2017",
caption = "Data Source: Kaggle",
y = "Deaths due to Air Pollution",
x = "") +
theme_minimal() +
theme(axis.title = element_text(size = 8, face = "bold"),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_blank(),
axis.line.x = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
axis.text = element_text(size = 10),
#legend.position = "none",
legend.text = element_text(size=5),
plot.subtitle = element_text(size = 8),
plot.title = element_text(size = 10, margin = margin(b = 10)))
deaths_line <- deaths_line + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
deaths_line4. How does deaths due to air pollution increase over time? More specifically, are death rates in recent X amount of years higher than death rates from groups of X years before?
library(scales)
#Aggregate data by total deaths by year
deathsbyyear <- group_by(.data = final_df,Year)
totdeath <- summarize(.data = deathsbyyear, tot_deaths = sum(Deaths.Air.Pollution.per.100k, na.rm = TRUE))
str(totdeath)## tibble [28 × 2] (S3: tbl_df/tbl/data.frame)
## $ Year : Factor w/ 28 levels "1990","1991",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ tot_deaths: num [1:28] 18161 17803 17927 18133 18134 ...
deaths_line <- ggplot() +
geom_line(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
geom_point(data = totdeath, mapping = aes(x = Year, y = tot_deaths, group = 1), size = 1.2) +
scale_color_manual(values = c("darkmagenta")) +
scale_y_continuous(label = comma, limits = c(0, 40000), breaks = seq(0,40000,10000))
# scale_x_discrete(limits = c(1990, 2017), breaks = seq(1990,2017,2))
deaths_line <- deaths_line + labs(title = "Deaths per 1000,000 by Air Pollution",
subtitle = "1990 - 2017",
caption = "Data Source: Kaggle",
y = "Deaths due to Air Pollution",
x = "") +
theme_minimal() +
theme(axis.title = element_text(size = 8, face = "bold"),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_blank(),
axis.line.x = element_line(color = "black"),
axis.ticks = element_line(color = "black"),
axis.text = element_text(size = 10),
legend.position = "none",
plot.subtitle = element_text(size = 8),
plot.title = element_text(size = 10, margin = margin(b = 10)))
deaths_line <- deaths_line + theme(plot.title = element_text(color = "black", size = 12, face = "bold", hjust = 0),
plot.subtitle = element_text(color = "black", size = 10, hjust = 0 ),
plot.caption = element_text(color = "black", size =8, face = "italic", hjust =0))
deaths_line## 'data.frame': 6468 obs. of 7 variables:
## $ Entity : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Code : chr "AFG" "AFG" "AFG" "AFG" ...
## $ Year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ Air.pollution..total...deaths.per.100.000. : num 299 291 279 279 287 ...
## $ Indoor.air.pollution..deaths.per.100.000. : num 250 243 232 232 239 ...
## $ Outdoor.particulate.matter..deaths.per.100.000.: num 46.4 46 44.2 44.4 45.6 ...
## $ Outdoor.ozone.pollution..deaths.per.100.000. : num 5.62 5.6 5.61 5.66 5.72 ...
5. Main Research Question
Do lower GDP countries have more deaths per 100k due to air pollution?
Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?
Linear Fit
Let’s first look at the general fit on the overall data.
- Fig XX: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.
From the plot, we observe that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.
Even looking at a each individual SDGRegion, their linear fits get better overall but are still not particularly strong with the highest being Australia/New Zealand and Europe at R2 of 0.56 and 0.55 respectively.
- Fig XX: Linear models for each SDGRegion, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
Let’s now look at how time plays a part. : Fig XX: Linear models for each Year, deaths due to air pollution per 100k vs GDP per capita, 1990 - 2017.
As observed, time does not seem to play a significant part in describing the relationship between deaths due to air pollution per 100k vs GDP per capita as the R2 stays roughly constant around 0.3 across all the years.
Transformed Log Scale - Linear Fit
Perhaps we should look at a non-linear fit. From our visuals, we see that every plot starts off at really high deaths due to air pollution per 100k then drops off dramatically as GDP per capita increases. However, the drop off begins to tamper off and asymptotically approaches some value. (It will be interesting to see if we can generalize what that GDP per capita value is. Let’s table that for later.) We have seen this type of behavior before in log graphs like shown below.
- Fig XX: Sample log graph.
Our data seems to be a -log(x) instead of log(x). Let’s transform our linear fit to a log fit by wrapping our features into a log() function and fitting back to a linear fit and see what the relationship is.
##
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita),
## data = final_df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.099 -0.235 0.000 0.206 1.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.07849 0.04871 207 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952 0.00323 -121 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.737
## F-statistic: 1.45e+04 on 1 and 5195 DF, p-value: <0.0000000000000002
- Fig XX, XX, XX: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship across the board.
Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.
Let’s test a few more regression models adding more features.
## [1] 0.883
## [1] 0.887
## [1] 0.747
R2 values for adding more features are 0.883, 0.887, and 0.747.
| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 975 | 7945 | 5044 | 3147 | 9101 | 2492 | 5911 | 2988 | 2640 | 5154 | 3892 | 3611 | 5909 | 1944 | 5987 | 7157 | 3933 | 5166 | 7168 | 9144 | 9611 | 5771 | 6747 | 3902 | 2131 | 5646 | 2101 | 4744 | 2281 | 2106 | 3457 | 2936 | 3714 | 5880 | 1584 | 4427 | 5680 | 3037 | 3437 | 6334 | 5707 | 8049 | 5965 |
| log(gdp.per.capita) | log(gdp.per.capita):SubRegionCaribbean | log(gdp.per.capita):SubRegionCentralAmerica | log(gdp.per.capita):SubRegionCentralAsia | log(gdp.per.capita):SubRegionEasternAfrica | log(gdp.per.capita):SubRegionEasternAsia | log(gdp.per.capita):SubRegionEasternEurope | log(gdp.per.capita):SubRegionMelanesia | log(gdp.per.capita):SubRegionMicronesia | log(gdp.per.capita):SubRegionMiddleAfrica | log(gdp.per.capita):SubRegionNorthernAfrica | log(gdp.per.capita):SubRegionNORTHERNAMERICA | log(gdp.per.capita):SubRegionNorthernEurope | log(gdp.per.capita):SubRegionPolynesia | log(gdp.per.capita):SubRegionSouth-EasternAsia | log(gdp.per.capita):SubRegionSouthAmerica | log(gdp.per.capita):SubRegionSouthernAfrica | log(gdp.per.capita):SubRegionSouthernAsia | log(gdp.per.capita):SubRegionSouthernEurope | log(gdp.per.capita):SubRegionWesternAfrica | log(gdp.per.capita):SubRegionWesternAsia | log(gdp.per.capita):SubRegionWesternEurope | SubRegionCaribbean | SubRegionCentralAmerica | SubRegionCentralAsia | SubRegionEasternAfrica | SubRegionEasternAsia | SubRegionEasternEurope | SubRegionMelanesia | SubRegionMicronesia | SubRegionMiddleAfrica | SubRegionNorthernAfrica | SubRegionNORTHERNAMERICA | SubRegionNorthernEurope | SubRegionPolynesia | SubRegionSouth-EasternAsia | SubRegionSouthAmerica | SubRegionSouthernAfrica | SubRegionSouthernAsia | SubRegionSouthernEurope | SubRegionWesternAfrica | SubRegionWesternAsia | SubRegionWesternEurope | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 987 | 8010 | 5074 | 3170 | 9188 | 2516 | 5949 | 2997 | 2662 | 5201 | 3917 | 3613 | 5952 | 1951 | 6051 | 7192 | 3955 | 5204 | 7232 | 9195 | 9704 | 5774 | 1.93 | 1.94 | 1.95 | 1.96 | 1.99 | 1.99 | 1.99 | 1.99 | 1.99 | 2.02 | 2.02 | 2.05 | 2.05 | 2.06 | 2.07 | 2.08 | 2.09 | 2.11 | 2.1 | 2.11 | 2.12 | 2.12 | 2.13 | 2.13 | 2.11 | 2.1 | 2.11 | 6800 | 3922 | 2144 | 5695 | 2121 | 4771 | 2285 | 2122 | 3486 | 2952 | 3717 | 5922 | 1587 | 4473 | 5704 | 3051 | 3458 | 6390 | 5730 | 8125 | 5967 |
| log(gdp.per.capita) | log(gdp.per.capita):Year1991 | log(gdp.per.capita):Year1992 | log(gdp.per.capita):Year1993 | log(gdp.per.capita):Year1994 | log(gdp.per.capita):Year1995 | log(gdp.per.capita):Year1996 | log(gdp.per.capita):Year1997 | log(gdp.per.capita):Year1998 | log(gdp.per.capita):Year1999 | log(gdp.per.capita):Year2000 | log(gdp.per.capita):Year2001 | log(gdp.per.capita):Year2002 | log(gdp.per.capita):Year2003 | log(gdp.per.capita):Year2004 | log(gdp.per.capita):Year2005 | log(gdp.per.capita):Year2006 | log(gdp.per.capita):Year2007 | log(gdp.per.capita):Year2008 | log(gdp.per.capita):Year2009 | log(gdp.per.capita):Year2010 | log(gdp.per.capita):Year2011 | log(gdp.per.capita):Year2012 | log(gdp.per.capita):Year2013 | log(gdp.per.capita):Year2014 | log(gdp.per.capita):Year2015 | log(gdp.per.capita):Year2016 | log(gdp.per.capita):Year2017 | Year1991 | Year1992 | Year1993 | Year1994 | Year1995 | Year1996 | Year1997 | Year1998 | Year1999 | Year2000 | Year2001 | Year2002 | Year2003 | Year2004 | Year2005 | Year2006 | Year2007 | Year2008 | Year2009 | Year2010 | Year2011 | Year2012 | Year2013 | Year2014 | Year2015 | Year2016 | Year2017 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34.7 | 187 | 179 | 181 | 177 | 183 | 185 | 186 | 185 | 183 | 185 | 186 | 187 | 187 | 189 | 191 | 194 | 197 | 202 | 209 | 213 | 216 | 219 | 221 | 223 | 224 | 222 | 222 | 187 | 180 | 181 | 177 | 184 | 187 | 189 | 187 | 185 | 187 | 188 | 190 | 192 | 196 | 200 | 205 | 210 | 217 | 223 | 228 | 232 | 236 | 239 | 240 | 240 | 238 | 238 |
Although adding more features into our regression model results in higher R2 values, the Variance Inflation Factor (VIF) for each are extremely high so we will reject those models as those added features are highly correlated with each other. Therefore, we will stick with our second model fit2.
We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:
\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 10.07849 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]
or
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?
Let’s test if means of deaths caused by air pollution per 100k across different GDP per capita levels are not equal.
H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal
Use \(\alpha\) of 0.05.
The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.
I will conduct 4 2-sample t-tests:
- Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
I will use a two sample t-test for each and use \(\alpha\) of 0.05.
Test 1:
p-valuetest1: 2.99e-203
p-valuetest1 < \(\alpha\)0.05 = TRUE
Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.
Test 2:
p-valuetest2: 1.47e-13
p-valuetest2 < \(\alpha\)0.05 = TRUE
Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 3:
p-valuetest3: 0e+00
p-valuetest3 < \(\alpha\)0.05 = TRUE
Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 4:
p-valuetest4: 2.91e-06
p-valuetest4 < \(\alpha\)0.05 = TRUE
Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
6. Conclusion
From all of our tests, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the strength and model can be quantified by Equation 2:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{10.07849 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]